Variant Discovery ◾ 135
RGLP: This argument sets the RG field which holds the technology name used for
sequencing. The value can be ILLUMINA, SOLID, LS454, HELICOS, or PACBIO.
RGLB: This sets the RG field which holds the DNA preparation library identifier. The
“MarkDuplicates” function of GATK uses this field to determine which RGs contain
duplicates.
The following script creates the directory “RG” and uses Picard to add the RG to each
BAM file. We use the run ID as RG and sample number.
mkdir RG
cd dedup
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
java -jar ~/software/picard.jar AddOrReplaceReadGroups \
I=${i}.bam \
O=../RG/${i}.RG.bam \
RGID=${i} \
RGLB=lib RGPL=ILLUMINA \
SORT_ORDER=coordinate \
RGPU=bar1 RGSM=${i}
samtools index ../RG/${i}.RG.bam
done
cd ..
4.2.2.2.9 Building a model for the BQSR
We already know that the raw data may have systematic errors that may affect reporting
of the base calling quality score. Such error may lead to overestimate or underestimate
the reported quality score. The quality of variant calling basically depends on the qual-
ity scores of the base calling that will also affect the read alignment. To minimize the
effect of the systematic errors on variant calling, a BQSR is implemented by GATK4
best practice. The BQSR is a machine learning-based method that uses training data to
model the empirically observed errors and adjust the quality scores of the aligned reads
using that model. The adjusted scores are then used by the variant caller to take deci-
sion about a variant calling. The BQSR is achieved in two steps: (i) using a set of known
variants as a training dataset for building the recalibration table (with BaseRecalibrator
GATK4 function) and (ii) adjusting the base quality scores (with ApplyBQSR GATK4
function). The first step of recalibration process generates a table indicating which sites
of the BAM file need adjustment of quality score. The second step of the recalibra-
tion process applies recalibration or adjusting the quality scores. The BQSR generates a
new BAM file with recalibrated quality scores that variant calling process can rely on.
Moreover, the known variants are used to mark the bases at the sites of real variation
to avoid being ignored as artifacts. The model training requires high-quality variant
datasets (SNPs and InDels) in VCF files downloaded from a reliable source such as
NCBI database. Human variant VCF files can also be downloaded from GATK resource
bundle as mentioned above.